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Abstract 

Sigmoid type belief networks, a class of probabilistic neural networks, provide a natural framework for 
compactly representing probabilistic information in a variety of unsupervised and supervised learning 
problems. Often the parameters used in these networks need to be learned from examples. Unfortunately, 
estimating the parameters via exact probabilistic calculations (i.e, the EM-algorithm) is intractable even 
for networks with fairly small numbers of hidden units. We propose to avoid the infeasibility of the E step 
by bounding likelihoods instead of computing them exactly. We introduce extended and complementary 
representations for these networks and show that the estimation of the network parameters can be made 
fast (reduced to quadratic optimization) by performing the estimation in either of the alternative domains. 
The complementary networks can be used for continuous density estimation as well. 
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1 Introduction 

The appeal of probabilistic networks for knowledge rep¬ 
resentation, inference, and learning (Pearl, 1988) derives 
both from the sound Bayesian framework and from the 
explicit representation of dependencies among the net¬ 
work variables which allows ready incorporation of prior 
information into the design of the network. The Bayesian 
formalism permits full propagation of probabilistic infor¬ 
mation across the network regardless of which variables 
in the network are instantiated. In this sense these net¬ 
works can be “inverted” probabilistically. 

This inversion, however, relies heavily on the use of 
look-up table representations of conditional probabili¬ 
ties or representations equivalent to them for modeling 
dependencies between the variables. For sparse depen¬ 
dency structures such as trees or chains this poses no 
difficulty. In more realistic cases of reasonably inter¬ 
dependent variables the exact algorithms developed for 
these belief networks (Lauritzen & Spiegelhalter, 1988) 
become infeasible due to the exponential growth in the 
size of the conditional probability tables needed to store 
the exact dependencies. Therefore the use of compact 
representations to model probabilistic interactions is un¬ 
avoidable in large problems. As belief network models 
move away from tables, however, the representations can 
be harder to assess from expert knowledge and the im¬ 
portant role of learning is further emphasized. 

Compact representations of interactions between sim¬ 
ple units have long been emphasized in neural networks. 
Lacking a thorough probabilistic interpretation, how¬ 
ever, classical feed-forward neural networks cannot be 
inverted in the above sense; e.g. given the output pat¬ 
tern of a feed-forward neural network it is not feasible 
to compute a probability distribution over the possible 
input patterns that would have resulted in the observed 
output. On the other hand, stochastic neural networks 
such as Boltzman machines admit probabilistic interpre¬ 
tations and therefore, at least in principle, can be in¬ 
verted and used as a basis for inference and learning in 
the presence of uncertainty. 

Sigmoid belief networks (Neal, 1992) form a subclass 
of probabilistic neural networks where the activation 
function has a sigmoidal form - usually the logistic func¬ 
tion. Neal (1992) proposed a learning algorithm for these 
networks which can be viewed as an improvement of 
the algorithm for Boltzmann machines. Recently Hin¬ 
ton et al. (1995) introduced the wake-sleep algorithm 
for layered bi-directional probabilistic networks. This 
algorithm relies on forward sampling and has an appeal¬ 
ing coding theoretic motivation. The Helmholtz machine 
(Dayan et ah, 1995), on the other hand, can be seen 
as an alternative technique for these architectures that 
avoids Gibbs sampling altogether. Dayan et al. also 
introduced the important idea of bounding likelihoods 
instead of computing them exactly. Saul et al. (1995) 
subsequently derived rigorous mean Held bounds for the 
likelihoods. In this paper we introduce the idea of alter¬ 
native - extended and complementary - representations 
of these networks by reinterpreting the nonlinearities in 
the activation function. We show that deriving likeli¬ 
hood bounds in the new representational domains leads 


to efficient (quadratic) estimation procedures for the net¬ 
work parameters. 


2 The probability representations 

Belief networks represent the joint probability of a set 
of variables {S'} as a product of conditional probabilities 
given by 

n 

P(S 1 ,...,S n )='[[P(S k \pa[k]), (1) 

k = 1 


where the notation pa[k], “parents of S/”, refers to all 
the variables that directly influence the probability of Sk 
taking on a particular value (for equivalent representa¬ 
tions, see Lauritzen et al. 1988). The fact that the joint 
probability can be written in the above form implies that 
there are no “cycles” in the network; i.e. there exists an 
ordering of the variables in the network such that no 
variable directly influences any preceding variables. 

In this paper we consider sigmoid belief networks 
where the variables S are binary (0/1), the conditional 
probabilities have the form 

P(Si |pa[i]) = g ((25) - 1) W t] S 3 ) (2) 

i 


and the weights Wij are zero unless Sj is a parent of 
Si, thus preserving the feed-forward directionality of the 
network. For notational convenience we have assumed 
the existence of a bias variable whose value is clamped 
to one. The activation function g(-) is chosen to be the 
cumulative Gaussian distribution function given by 


g(x) 



dz = 



z -¥/-*fdz 


(3) 

Although very similar to the standard logistic func¬ 
tion, this activation function derives a number of ad¬ 
vantages from its integral representation. In particular, 
we may reinterpret the integration as a marginalization 
and thereby obtain alternative representations for the 
network. We consider two such representations. 

We derive an extended representation by making ex¬ 
plicit the nonlinearities in the activation function. More 
precisely, 


R(S)|pa[i]) = g((2Si - 1) WijSj ) 


def 


_J_-U z '-( 2S '~ 1 )J2, w aSj? 

Jo 
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/ P(Si, Zi\pa,[i])dZi 


(4) 


This suggests defining the extended network in terms 
of the new conditional probabilities P(Si, Z 8 jpa[i]). By 
construction then the original binary network is obtained 
by marginalizing over the extra variables Z. In this sense 
the extended network is (marginally) equivalent to the 
binary network. 

We distinguish a complementary representation from 
the extended one by writing the probabilities entirely in 



terms of continuous variables 1 . Such a representation 
can be obtained from the extended network by a simple 
transformation of variables. The new continuous vari¬ 
ables are defined by Zi = (2 Si — 1 )Zi, or, equivalently, 
by Zi = \Zi \ and Si = 0(Zi) where O(-) is the step func¬ 
tion. Performing this transformation yields 


P(Zi\pa[i]) 


1 -ftZj-Y WjjHZj )] 3 


( 5 ) 


which defines a network of conditionally Gaussian vari¬ 
ables. The original network in this case can be recovered 
by conditional marginalization over Z where the condi¬ 
tioning variables are 9{Z). 

Figure 1 below summarizes the relationships between 
the different representations. As will become clear later, 
working with the alternative representations instead of 
the original binary representation can lead to more flex¬ 
ible and efficient (least-squares) parameter estimation. 



Figure 1: The relationship between the alternative rep¬ 
resentations. 


3 The learning problem 

We consider the problem of learning the parameters of 
the network from instantiations of variables contained 
in a training set. Such instantiations, however, need not 
be comple||; there may be variables that have no value 
assignments in the training set as well as variables that 
are always instantiated. The tacit division between hid¬ 
den (H) and visible (V) variables therefore depends on 
the particular training example considered and is not an 
intrinsic property of the network. 

To learn from these instantiations we adopt, the princi¬ 
ple of maximum likelihood to estimate the weights in the 
network. In essence, this is a density estimation prob¬ 
lem where the weights are chosen so as to match the 
probabilistic behavior of the network with the observed 
activities in the training set. Central to this estimation is 
the ability to compute likelihoods (or log-likelihoods) for 
any (partial) configuration of variables appearing in the 
training set. In other words, if we let A' 1 be the con¬ 
figuration of visible or instantiated variables 2 and X H 
denote the hidden or uninstantiated variables, we need 

1 While the binary variables are the outputs of each unit 
the continuous variables pertain to the inputs - hence the 
name complementary. 

2 To postpone the issue of representation we use A” to de¬ 
note S , {S, Z}, or Z depending on the particular representa¬ 
tion chosen. 


to compute marginal probabilities of the form 

logP(X v ) = logJ2P(X V ,X H ) ( 6 ) 

x H 

If the training samples are independent, then these log 
marginals can be added to give the overall log-likelihood 
of the training set 

log P (training set) = E logP(A'*) (7) 

t 

Unfortunately, computing each of these marginal proba¬ 
bilities involves summing (integrating) over an exponen¬ 
tial number of different configurations assumed by the 
hidden variables in the network. This renders the sum 
(integration) intractable in all but few special cases (e.g. 
trees and chains). It is possible, however, to instead find 
a manageable lower bound on the log-likelihood and op¬ 
timize the weights in the network so as to maximize this 
bound. 

To obtain such a lower bound we resort to Jensen’s 
inequality: 


log P(X V ) 


> 


\ogJ2P(X H ,X v ) 

x H 


log^Q(A ff ) 

x H 


P{ X H ,X V ) 

Q(x H ) 


x H 


P{ X H ,X V ) 

Q(x H ) 


( 8 ) 


Although this bound holds for all distributions Q(X) 
over the hidden variables, the accuracy of the bound is 
determined by how closely Q approximates the posterior 
distribution P(X H | A’ 1 ) in terms of the Kullback-Leibler 
divergence; if the approximation is perfect the divergence 
is zero and the inequality is satisfied with equality. Suit¬ 
able choices for Q can make the bound both accurate 
and easy to compute. The feasibility of finding such Q, 
however, is highly dependent on the choice of the repre¬ 
sentation for the network. 


4 Likelihood bounds in different 
representations 

To complete the derivation of the likelihood bound 
(equation 8) we need to fix the representation for the 
network. Which representation to select, however, af¬ 
fects the quality and accuracy of the bound. In addi¬ 
tion, the accompanying bound of the chosen reprpen- 
tat.ion implies bounds in the other two representational 
domains as they all code the same distributions over the 
observables. In this section we illustrate these points 
by deriving bounds in the complementary and extended 
representations and discuss the corresponding bounds in 
the original binary domain. 

Now, to obtain a lower bound we need to specify the 
approximate posterior Q. In the complementary rep¬ 
resentation the conditional probabilities are Gaussia.ns 
and therefore a reasonable approximation (mean field) 
is found by choosing the posterior approximation from 






the family of factorized Gaussians: 

Q(^=u4=^- hi)2/2 w 

i * 

Substituting this into equation 8 we obtain the bound 

log P(S*) > —'Eihi-'ZjJiMh:)) 2 

i 

( 10 ) 

ij 

The means hi for the hidden variables are adjustable pa¬ 
rameters that can be tuned to make the bound as tight 
as possible. For the instantiated variables we need to 
enforce the constraints g{hi) = S* to respect the in¬ 
stantiation. These can be satisfied very accurately by 
setting hi = 4(25* — 1). A very convenient property 
of this bound and the complementary representation in 
general is the quadratic weight dependence - a property 
very conducive to fast learning. Finally, we note that the 
complementary representation transforms the binary es¬ 
timation problem into a continuous density estimation 
problem. 

We now turn to the interpretation of the above bound 
in the binary domain. The same bound can be obtained 
by first fixing the inputs to all the units to be the means 
hi and then computing the negative total mean squared 
error between the fixed inputs and the corresponding 
probabilistic inputs propagated from the parents. The 
fact that this procedure in fact gives a lower bound on 
the log-likelihood would be more difficult to justify by 
working with the binary representation alone. 

In the extended representation the probability distri¬ 
bution for Zi is a truncated Gaussian given Si and its 
parents. We therefore propose the partially factorized 
posterior approximation: 

Q(S,Z) = l[Q(Zi\Si)Q(Si) (11) 

i 

where Q(Zi\Si) is a truncated Gaussian: 

Q(Z t |S,0 = —--r4=e-* (Z< - (2S< - 1)hi)a (12) 

As in the complementary domain the resulting bound 
depends quadratically on the weights. Instead of writing 
out the bound here, however, it is more informative to 
see its derivation in the binary domain. 

A factorized posterior approximation (mean Held) 
Q(S) = qf'( 1 — qiY~ s ' for the binary network yields 
a bound 

^gP(S*)>J2{S t ^gg(J2 j j tJ s J )} 

i 

i 

~Yl[ qilo & qi + (! - tfOlogt 1 - Qi)} (13) 

i 

where the averages (•) are with respect to the Q distribu¬ 
tion. These averages, however, do not conform to analyt¬ 
ical expressions. The tractable posterior approximation 

O 


in the extended domain avoids the problem by implicitly 
making the following Legendre transformation: 

logfif(«) = [^* 2 + log g{x)\-]^x 2 

> Xx-G(X)-^x 2 (14) 

which holds since x 2 /2 + log < 7 ( 0 :) is a convex function. 
Inserting this back into the relevant parts of equation 13 
and performing the averages gives 

log P(S*) > ^[g;A; - (1 - g )X q : 

* 3 

— ^[SjG ! (A 8 ') + (1 — qi)G(Xi)] 

i 

~^(X J 5 ? i) 2 - \ X 4«i(! - 9:) 

3 ij 

-X[ ftlog P + (! - - g)] (15) 

i 

which is quadratic in the weights as expected. The mean 
activities q for the hidden variables and the parameters 
A can be optimized to make the bound tight. For the 
instantiated variables we set qi = S*. 

5 Numerical experiments 

To test these techniques in practice we applied the com¬ 
plementary network to the problem of detecting motor 
failures from spectra obtained during motor operation 
(see Petsche et al. 1995). We cast the problem as a con¬ 
tinuous density estimation problem. The training set 
consisted of 800 out of 1283 FFT spectra each with 319 
components measured from an electric motor in a good 
operating condition but under varying loads. The test 
set included the remaining 483 FFTs from the same mo¬ 
tor in a good condition in addition to three sets of 1340 
FFTs each measured when a particular fault was present. 
The goal was to use the likelihood of a test FFT with 
respect to the estimated density to determine whether 
there was a fault present in the motor. 

We used a layered 6 —>■ 20 —>■ 319 generative model to 
estimate the training set density. The resulting classifi¬ 
cation error rates on the test set are shown in figure 2 as a 
function of the threshold likelihood. The achieved error 
rates are comparable to those of Petsche et al. (1995). 

6 Conclusions 

Network models that admit probabilistic formulations 
derive a number of advantages from probability theory. 
Moving away from explicit representations of dependen¬ 
cies, however, can make these properties harder to ex¬ 
ploit in practice. We showed that an efficient estimation 
procedure can be derived for sigmoid belief networks, 
where standard methods are intractable in all but a few 
special cases (e.g. trees and chains). The efficiency of 
our approach derived from the combination of two ideas. 
First, we avoided the intractability of computing likeli¬ 
hoods in these networks by computing lower bounds in¬ 
stead. Second, we introduced new representations for 




Figure 2: The probability of error curves for missing 
a fault, (dashed lines) and misclassifying a good motor 
(solid line) as a function of the likelihood threshold. 


these networks and showed how the lower bounds in the 
new representational domains transform the parameter 
estimation problem into quadratic optimization. 

Acknowledgements: 

The authors wish to thank Peter Dayan for helpful com¬ 
ments on the manuscript. 

References 

P. Dayan, G. Hinton, R. Neal, and R. Zernel (1995). The 
helmholt.z machine. Neural Computation 7: 889-904. 

A. Dempster, N. Laird, and D. Rubin. Maximum likeli¬ 
hood from incomplete data. via. the EM algorithm (1977). 
./. Roy. Statist. Soc. B 39:1-38. 

G. Hinton, P. Da.ya.n, B. Frey, and R. Neal (1995). The 
wake-sleep algorithm for unsupervised neural networks. 
Science 268: 1158-1161. 

S. L. La.urit.zen and D. J. Spiegelha.lt.er (1988). Local 
computations with probabilities on graphical structures 
and their application to expert, systems. ./. Roy. Statist. 
Soc. B 50:154-227. 

R. Neal. Connect.ionist. learning of belief networks 
(1992). Artificial Intelligence 56: 71-113. 

J. Pearl (1988). Probabilistic Reasoning in Intelligent 
Systems. Morgan Ka.ufma.nn: San Ma.t.eo. 

T. Pet.sche, A. Ma.rca.nt.onio, C. Darken, S. J. Hanson, 
G. M. Kuhn, I. Sa.nt.oso (1995). A neural network a.u- 
t.oa.ssocia.t.or for induction motor failure prediction. In 
Advances in Neural Information Processing Systems 8. 
MIT Press. 

L. Iv. Saul, T. Ja.a.kkola., and M. I. Jordan (1995). Mean 
Held theory for sigmoid belief networks. M.I. T. Compu¬ 
tational Cognitive Science Technical Report 9501. 


4 




